Genomes: at the edge of chaos with maximum information capacity 
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We propose an order index, <j>, which quantifies the notion of "life at the edge of chaos" when 
applied to genome sequences. It maps genomes to a number from (random and of infinite length) 
to 1 (fully ordered) and applies regardless of sequence length. The 786 complete genomic sequences 
in GenBank were found to have <j> values in a very narrow range, 0.037±0.027. We show this implies 
that genomes are halfway towards being completely random, namely, at the edge of chaos. We argue 
that this narrow range represents the neighborhood of a fixed-point in the space of sequences, and 
genomes are driven there by the dynamics of a robust, predominantly neutral evolution process. 

PACS numbers: 87.14.Gg, 87.15. Cc, 02.50.-r, 05.45.-a, 89.70.4-c, 87.23. Kg 



The Edge of chaos originally refers to the state of a 
computational system, such as cellular automata, when 
it is close to a transition to chaos, and gains the ability for 
complex information processing P, The notion has 

since been used to describe biological states, and life in 
general, on the assumption that life necessarily involves 
complex computation [4|. In model systems such as cel- 
lular automata, there are well defined procedures for rec- 
ognizing the change in computational cab ability during 
the transition from non-chaotic to chaotic states [l|, Q . 
However, these have not been adapted to the wider bio- 
logical context, even for the simplest of organisms. But 
if we represent a living organism by its genome, view 
evolution as a dynamical process that drives genomes in 
the space of sequences, and consider chaos as a state of 
genome randomness, then we have a framework within 
which the meaning of "life occurs at the edge of chaos" 
may be investigated. Genomes, linear sequences writ- 
ten in the four chemical letters, or bases, A (adenine), 
C (cytosine), G (guanine) and T (thymine) and often 
referred to as books of life, regulate the functioning of 
organisms through the many kinds of codes embedded 
in them (there are also non-textual post-translational 
regulations; see, e.g. @). When genomes are seen as 
texts, they have several key properties reflecting their 
complexity, including long-range correlations and scale 
invariance @, B, 3 (although this topic is debated [§]), 
self-similarityjEoJnlJll, Il3j . and distinctive Shannon 
redundancy \l4. Il5l Il6|. However, these properties do 
not give a measure of the proximity of a genome to chaos 
or randomness. Before the edge-of-chaos notion can be 
explored, one needs to have a quantity that measures the 
randomness of genomes as texts. 

Here we analyze genomes in terms of the frequency of 
occurrence of fc-letter words, called fc-mers, where A; is a 
small integer [l7|. For a given fc, the 4 fc types of fc-mers 
are partitioned into fc+1 "m-sets", m=0-fc. An m-set is 
composed of all the fc-mers containing m and only m A or 
■>kfk s 



The reason for partitioning the fc-mers according to AT- 
content for statistical purposes is that although the A:T 
and C:G ratios are invariably close to 1, [H, [2(| , the 
AT to GC ratio may differ significantly. This partition 
is needed for preventing biased base composition from 
masking crucial statistical information in genomes [9l.[l6|. 
For fc>2, the k th order index for a sequence of length L 
(in bases) is 



(2 - 2(p k + q k )) 



L {oo} 



(1) 



where 0<p<l is the fractional AT-content in the se- 
quence; q=l-p; L m is the total number of fc-mers in the 
m-set; and L^°^ is the expected value for L m in a p- 
valued random sequence of infinite length: L^°^=L2~ fe 
T m p m q k rn ■ The definition of (j> is based on the obser- 
vation that distribution-averages are useful indicators of 
the randomness of a sequence. The denominator on the 
right-hand-side of Eq. ([I]) is a normalization factor which 
ensures c/>«l for an ordered sequence (in which all AT's 
are on, say, the 5' end and all CG's are on the 3' end). 
The singularities at p= and 1 are not a practical prob- 
lem since no genome has such extreme base composition. 

From the central limit theory we expect, for random 
sequences, \L m — h\n'^\ to scale as L m 1|/2 . We therefore 
expect (f) to be proportional to L^ 1 / 2 on average. The 
log-log plots in Fig. Q] (a) and (b) show as a function 
of sequence length for different fc's and p's. Each datum 
is averaged over 500 random sequences. It is seen that <p 
scales very well as L -1 / 2 (with sizable fluctuations), and 
is only weakly dependent on fc and p. These results can 
be summarized for all fc and p by an empirical relation: 



c<f>L' 



-7* 



(2) 



T's. There are t„ 



types of fc-mers in an m-set. 



with 7^ = 0.50±0.01 and = 1.0±0.2 or, to a good ap- 
proximation, (/)f ro ™l wL" 1 / 2 . This leads to the convenient 
concept of an equivalent length for a 0-value sequence, 
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FIG. 1: (a) Log-log plot of order index, </>, vs. length of ran- 
dom sequence for p=0.5 and k=2— 6. (b) Same as (a); for fc=4 and 
p=0.20-0.50. (c) Semi-log plot of (j> vs. Nn,, number of random point 
mutations, for an initially ordered 20 Mb, p=0.5 sequence. The inter- 
section of the red lines is the critical point where sequence becomes 
random, (d) Same as (c); initial sequence is genome of E. coli. 

L e q((j))=(f)~ 2 , the nominal length of a random sequence 
whose order index is <j>. 

Random events such as point mutations acting on a 
non-random sequence decreases its order, and hence its 
cj>. Fig. Q] (c) shows that the of a p=0.5, 20 Mb or- 
dered sequence, decreases exponentially with the number 
of mutations N^, until reaches a critical number N^ c . 
The critical value reflects the fact that a random sequence 
does not become more random with further changes. In 
other words, if one thinks of random point mutation as a 
dynamical action taking a sequence from one point in the 
sequence space to another, then a randomized sequence 
is a fixed-point of the action. Our studies of initially 
ordered sequences having a variety of lengths and base 
compositions yield, 



exp(-2iV M /i), N^<N, 
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where the iV MC «(l/4)LlnL, and the critical mutation 
rate is /x c =AT Atc /L«(l/4)lnL. The formula for N^ c com- 
pares well with simulation. In the case of Fig. Q] (c), the 
coordinates of the simulation (fc=4) critical point are (</> c , 
A/ MC ) = (2.2xlO~ 4 , 8.5xl0 7 ), as compared to the "theoret- 
ical" values (2.2xl(T 4 , 8.4xl0 7 ). For typical sequences 
of genomic length (L~10 1=tl Mb), /i c =4.0±0.6 mutations 
per base (b _1 ). We use Eq. ([3]) to assign to a f/>- valued 
sequence an equivalent mutation rate, fj, eq (cj))=hx r/) -1 / 2 , 
the nominal number of random point mutations per base 
required to bring the index of an ordered sequence to <fi. 

Eq. ([3]) can be adapted for application to sequences not 
initially ordered. For example, the equivalent mutation 
rate for the 4.6 Mb genome of E. coli (0=0.049) is 1.5 b _1 . 
Since for a 4.6 Mb sequence ^ c =3.8 b^ 1 , one expects an 
additional 2.3x4.6xl0 6 = 1.1 xlO 7 mutations are needed 
to randomize it. In the simulation shown in Fig. [T] (d), 
the actual number needed is found to be (1.1±0.1) x 10 7 . 



We computed tj> for 384 complete prokaryotic genomes 
(28 archaebacteria and 356 eubacteria) and 402 complete 
chromosomes from 28 eukaryotes of lengths ranging from 
200 kb to 230 Mb. The rice genome was downloaded 
from the Rice Annotation Project Database [2l|, and all 
other sequences from the National Center for Biotechnol- 
ogy Information genome database [22| , during the period 
26 Feb.-27 Nov., 2006. The 28 eukaryotes (number of 
chromosomes and genome length in parenthesis) include 
11 fungi, A. fumigatus (8, 28.8 Mb), C. albicans (1, 0.95 
Mb), C. glabrata (13, 12.3 Mb), C. neoformans (14, 19.1 
Mb), D. hansenii (7, 12.2 Mb), E. cuniculi (11, 2.50 Mb), 
E. gossypii (7, 8.74 Mb), K. lactis (6, 10.7 Mb), S. cere- 
visiae (Yeast) (16, 12.1 Mb), S. pombe (Fission Yeast) (3, 
10.0 Mb), Y. lipolytica (6, 20.5 Mb); the unicellular P. 
falciparum (Malaria) (14, 22.9 Mb); 2 plants, A. thaliana 
(Mustard) (5, 119 Mb), O. sativa (Rice) (12, 372 Mb); 5 
insects, C. elegans (Worm) (6, 100 Mb), D. melanogaster 
(Fly) (6, 118 Mb), A. gambiae (Mosquito) (5, 223 Mb), 
A. mellifera (Bee) (16, 183 Mb), T. castaneum (Beetle) 
(10, 112 Mb); 9 vertebrates, D. rerio (Zebrafish) (25, 
1.04 Gb), G. gallus (Chicken) (30, 933 Mb), B. taurus 
(Cow) (30, 1.41 Gb), C. familiaris (Dog) (39, 2.31 Gb), 
M. musculus (Mouse) (21, 2.57 Gb), R. norvegicus (Rat) 
(21, 2.50 Gb), M. mulatta (Monkey) (21, 2.73 Gb), P. 
troglodytes (Chimpanzee) (25, 2.86 Gb), H. sapiens (Hu- 
man) (24, 2.87 Gb). 

The results shown in Fig. [5] indicate that genomic <p's 
systematically vary neither with sequence length ((a) and 
(b)), nor with base composition ((c)). Instead they have 
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FIG. 2: (a) Order index, <f>, vs. sequence length, L, for 384 prokary- 
otic genomes (gray b's), 402 eukaryotic chromosomes (black *'s), and 
random sequences (line composed of V s )- ' n f) : B° x (gray for 
prokaryotes; black for eukaryotes) height is given by 25% to 50% val- 
ues and the range represents 10% to 90% values; numbers above boxes 
are numbers of sequences in group; all 0's are averaged over k=2 to 
6. (b) <j> vs. logi. (c) <j> vs. fractional AT-content, p. (d) Ratio of 
4> ct l (for coding parts) to </> nc£ j (noncoding) vs. logL. (e) Ratio of 
4>mRNA (mRNA segments) to 4> nm nNA (non-mRNA), averaged over 
classes of eukaryotes. (f) Ratio of equivalent mutation rate, fi eq , to 
critical mutation rate, fi c , vs. logL. 

a nearly universal value — the average over all sequences 
is 3 =O.O37±O.O27 (this defines the symbol <j> g ). We have 
verified that, as a general rule, within a genome the vari- 
ation in segmental <p decreases with segmental length and 
the average cj> reaches its whole-genome value when the 
size of the segment exceeds 50 kb. In Fig.[2](a) the spread 
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in (j> of the genomic data shows a tendency to decrease 
with sequence length. Part of this effect may be purely 
statistical: smaller sample sizes (i.e., sequence lengths) 
tend to have larger statistical fluctuations. Part of it may 
also be because sequences longer than 10 Mb are all from 
chromosomes of multicellular eukaryotes that are phylo- 
genetically close. In any case Fig. [5] (a) clearly puts the 
genomes in a category apart from random sequences. 

From each complete sequence, we extracted the cod- 
ing and noncoding parts (owing to imperfect annotation, 
the sum of the parts sometimes differ slightly from the 
whole), then concatenated the parts into two separate 
sequences and computed their order indexes, <f> c d and 
4>ncd, respectively. A summary of the ratio 4> c d/4>ncd for 
sets of genomes grouped by length is given in Fig. [5] (d). 
For prokaryotes the ratio ranges (10 th to 90 th percentile) 
from 0.15 to 3 with a median of about 0.5. Notable ex- 
ceptions are the three bacteria with exceptionally large 
genomes (£<10 Mb) with ratios ranging from 5 to 7: S. 
avermitilis, S. coelicolor, and Mycobacterium sp. MCS. 

For the eukaryotic chromosomes longer than 10 Mb the 
ratios do not significantly deviate from unity. Mustard, 
whose coding and noncoding parts have nearly equal 
lengths (~10-12 Mb), is the only exception in this cat- 
egory with 4> c d/ 4>ncd~7 (these ratios are beyond the 90 
percentile and therefore are not included in Fig. [U (d)). 
In this case 4> c d~0-05b is similar to other genomes while 
0ncd~O.OO75 is about seven times less than the norm. 
Rice, the only other plant included in this study, with 
^cd/0ncd~O.35 is unlike mustard but more like the other 
eukaryotes. For the eukaryotic chromosomes shorter than 
10 Mb the ratios average to about 2 but show greater 
variation. 

The coding parts of eukaryotic genomes are further 
partitioned into mRNA and non-mRNA parts, and their 
0's computed separately. Averaged over sets of organ- 
isms, 4'mRNA/<finmRNA is of the order of 1, with the ratio 
being ~0.5 for insects and ~2 for plants (Fig. [5] (e)). For 
the latter, the ratio is ~1 for the five chromosomes of 
mustard and ~2 for the twelve chromosomes of rice. In 
summary, the differences in <f> between coding and non- 
coding parts, and between mRNA and non-mRNA parts 
are much smaller than the difference between genomes 
and random sequences. 

The ratio /i eq (4>) / fj, c is an indication of how close a 
sequence is to being random. Fig. [5] (f) shows that 
the shorter (L>10 Mb) sequences are roughly half-way, 
and the longer sequences, one-third of the way, towards 
becoming random. The systematic but weak length- 
dependence of the ratio is explained by the fact that 
the genomic <fi, hence /i eg (0), is approximately constant, 
whereas [i c is proportional to In L. The overall average 
of the ratio is 0.45±0.11. 

We summarize our results by considering the function 

I(z) = -zlnz-(l-z)bx(l-z) (4) 

where z=(j> x and A=0.21. The value of the exponent A is 
determined by requiring that z—0.5 at <f>—<f> g . I(z) is the 



simplest function that maps the range (0,1) to a positive 
real value, has zeros at (and only at) z—Q and 1, has 
a maximum at z=0.5 and is symmetric with respect to 
the point z=0.5. In Fig. [3] the parabola-like curve shows 
I(z) plotted against z. In addition, three other sets of 
abscissas are given: (j>; log 10 L eq (</>) , where L eq (4>) is the 
equivalent length (Eq. ([!])); and /j, eq (<p), the equivalent 
mutation rate. A scale linear in z, relative to one in 0, is 
a better representation of the space of possible sequence 
lengths. It is seen in Fig.[3]that genomes are concentrated 
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FIG. 3: The function I(z) (Eq. Q) plotted as a function of: z=<j> x 
(A=0.21); <f>; \og 10 L eq (<j>) (Eq. ©); fi eq (<t>) (in units of b" 1 ; Eq. ©). 
Data from prokaryotic (gray) and eukaryotic (black) genomes occur 
near the peak of / and have 0~</< 9 , L eq ^.2S— 10 kb, and /i eg ~1.8±0.5 
b- 1 . 

near the peak of the /-curve and equally and far removed 
from the random (z~0) and ordered (^~1) sequences. 

The genomic equivalent lengths, occupying a small 
neighborhood around at L e(? (0 g )=73O b, are far shorter 
than the actual lengths of complete sequences. Among 
the many possible mechanisms that may cause long se- 
quences to have short equivalent lengths, by far the sim- 
plest is replication. This is because a long sequence of 
length L composed of multiple replications of a random 
sequence I bases long will have L eq ~l, independent of L. 
Similarly, if genome growth is dominated by random seg- 
mental duplication [23l [24j, |25j , then the genomic L eq will 
be much shorter genome length [lq ]. 

The genomic equivalent mutation rates span a small 
range around 1.8 b _1 , or about 45% of the critical muta- 
tion rate of approximately 4 b^ 1 that would randomize 
the genomes. Thus, for example, a typical worm (C. 
elegans) chromosome, with an average length of 17 Mb 
and an equivalent mutation rate of 1.8 b -1 , is as random 
as an initially ordered 17 Mb sequence after having un- 
dergone 31 million random mutations - as compared to 
the 68 million mutations which would randomize the se- 
quence. In this sense genomes are quasi-random - or "at 
the edge of chaos". For a linear text, quasi-randomness 
satisfies two crucial necessary conditions for high infor- 
mation content: high efficiency and large variation in 
word usage. A random sequence has maximum word- 
usage efficiency because all its fc-mers in an m-set have 
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occurrence frequencies very close to the theoretical mean 
frequency of the set, /m°°^=i^°^ jr m . However, this also 
implies minimum word-usage variation, which prevents 
a random sequence from being information-rich. In a 
quasi-random sequence a compromise between high effi- 
ciency and large variation in word usage is obtained by 
suitably relaxing the equal- frequency condition [l6| , thus 
allowing a genome at the edge of chaos to have close to 
maximum information capacity. 

The high concentration of genomic (fi's near <f> g may 
be interpreted as the signature of a certain robust char- 
acteristics in the genomic evolution processes. The near 
equality of 0's for coding and noncoding regions within a 
genome suggests that the underlying evolution processes 
are not dominated by codon selection, but are likely pre- 
dominantly selectively neutral [2(| [27| • We therefore pro- 
pose the following conjecture: Just as randomness is a 
fixed-point of the action of random point mutations, the 
state of genomes defined by (fi^4> g is a fixed-point of the 
action of a robust, predominantly neutral evolution pro- 
cess. The observed shortness of L eq ((fi g ) suggests that the 
neutral process is dominated b y ( non-deleterious) ran- 
dom segmental duplications [23LI24L I25I ] , occurring singly 
[iH [HI] and in tandem [29[ . We consider random seg- 
mental duplication to be an infrastructure-building pro- 
cess because it does not necessarily produce informa- 
tion directly. Instead, it causes genomic (fi to be close 
to <fi g , giving genomes maximum information capacity. 



Since this enhances genomic fitness indirectly, the neu- 
tral process may in itself be a product of natural selec- 
tion. The near randomness of the neutral process guar- 
antees the fixed-point associated with cfi g to have a very 
large configuration space, hence relatively low free en- 
ergy, thus rendering (fi g -vahied states widely accessible. 
In contrast, non-neutral, information-gathering processes 
dominated by selection (narrowly construed) are pre- 
dominantly point mutations: they are poor mechanisms 
for inducing genomic states of maximum capacity, and 
do not lead to widely accessible states. Taken together 
these suggest that the evolution of the genome may have 
been driven by a two-stage process: one neutral, robust, 
infrastructure-building and universal, and the other se- 
lective, fine-tuning, information-gathering and diverse. 
An example of such a two-step process is found in the 
paradigm of accidental gene duplication followed by mu- 
tation driven subfunctionalization [3(| [3l[ . We may as- 
sume that during the long history of the genome's growth 
and evolution, the twin-processes acted in a ratchet-like, 
complementary manner, driving the genome, in succes- 
sive stages, to a state of maximum information capacity, 
and helping it to acquire, at each stage, near-maximum 
information content. 
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